A powerful subset-based method identifies gene set associations and improves interpretation in UK Biobank

نویسندگان

چکیده

Tests of association between a phenotype and set genes in biological pathway can provide insights into the genetic architecture complex phenotypes beyond those obtained from single-variant or single-gene analysis. However, most existing gene tests have limited power to detect set-phenotype when small fraction are associated with cannot identify potentially “active” that might drive set-based association. To address these issues, we developed Gene analysis Association Using Sparse Signals (GAUSS), method for requires only GWAS summary statistics. For each significantly set, GAUSS identifies subset maximal evidence best account pre-computed correlation structure among test statistics reference panel, our p value calculation is substantially faster than other permutation- simulation-based approaches. In simulations varying proportions causal genes, find effectively controls type 1 error rate has greater several methods, particularly proportion signal. GAUSS, analyzed UK Biobank 10,679 sets 1,403 binary phenotypes. We found scalable identified 13,466 pairs. Within sets, an average 17.2 (max = 405) underlie associations. IntroductionOver last fifteen years, genome-wide studies (GWASs) thousands variants hundreds diseases phenotypes.1Buniello A. MacArthur J.A.L. Cerezo M. Harris L.W. Hayhurst J. Malangone C. McMahon Morales Mountjoy E. Sollis et al.The NHGRI-EBI Catalog published studies, targeted arrays 2019.Nucleic Acids Res. 2019; 47: D1005-D1012Crossref PubMed Scopus (1447) Google Scholar date, individually collectively, typically heritability.2Manolio T.A. Collins F.S. Cox N.J. Goldstein D.B. Hindorff L.A. Hunter D.J. McCarthy M.I. Ramos E.M. Cardon L.R. Chakravarti al.Finding missing heritability diseases.Nature. 2009; 461: 747-753Crossref (5790) A possible explanation because large number polymorphisms examined GWASs massive conducted, many weak associations missed after multiple comparison adjustments.3Liu J.Z. McRae A.F. Nyholt D.R. Medland S.E. Wray N.R. Brown K.M. Hayward N.K. Montgomery G.W. Visscher P.M. Martin N.G. Macgregor S. AMFS InvestigatorsA versatile gene-based studies.Am. Hum. Genet. 2010; 87: 139-145Abstract Full Text PDF (626) ScholarGene (GSA) may not be analysis, especially rare moderate effects.4Cantor R.M. Lange K. Sinsheimer J.S. Prioritizing results: review statistical methods recommendations their application.Am. 86: 6-22Abstract (435) GSA, individual aggregated groups sharing certain functional characteristics. This approach considerably reduces performed much smaller tested.5Fridley B.L. Biernacka J.M. SNP data: benefits, challenges, future directions.Eur. 2011; 19: 837-843Crossref (113) Scholar,6Yu Li Q. Bergen A.W. Pfeiffer Rosenberg P.S. Caporaso N. Kraft P. Chatterjee Pathway by adaptive combination P-values.Genet. Epidemiol. 33: 700-709Crossref (220) Additionally, manifested through combined activity variants, so GSA insight involvement specific pathways cellular mechanisms phenotype.7Pers T.H. interpreting studies.Hum. Mol. 2016; 25: R133-R140Crossref (10) ScholarGSA aims regarding one two types null hypotheses:6Yu (1) competitive hypothesis which interest no more any outside it (2) self-contained none phenotype. Several perform been successfully diseases.8Lee P.H. O’Dushlaine Thomas B. Purcell S.M. INRICH: interval-based enrichment studies.Bioinformatics. 2012; 28: 1797-1799Crossref (167) Scholar, 9Jia Wang L. Meltzer H.Y. Zhao Z. Pathway-based datasets: effective but caution required.Int. Neuropsychopharmacol. 14: 567-572Crossref (56) 10O’Dushlaine Kenny Heron E.A. Segurado R. Gill Morris D.W. Corvin The ratio test: datasets.Bioinformatics. 2762-2763Crossref (118) 11Mooney M.A. Nigg J.T. McWeeney S.K. Wilmot Functional genomic context data.Trends 2014; 30: 390-400Abstract (76) 12Pan W. Kwak I.-Y. Wei Powerful Pathway-Based Adaptive Test Genetic Common Rare Variants.Am. 2015; 97: 86-98Abstract (45) 13de Leeuw C.A. Mooij Heskes T. Posthuma D. MAGMA: Generalized Gene-Set Analysis Data.PLoS Comput. Biol. 11: e1004219Crossref (1039) 14Sun Hui Bader G.D. Lin X. Berk-Jones statistic.PLoS 15: e1007530Crossref (20) 15Zhang H. Wheeler Hyland P.L. Yang Y. Shi Yu Procedure Meta-analysis Summary Statistics Identifies 43 Pathways Associated Type II Diabetes European Populations.PLoS 12: e1006122Crossref (24) example, de al.13de MAGMA, transforms values Z using inverse normal transformation employs linear regression Pan al.12Pan aSPUpath, uses statistic based on sum powered scores calculates permutation-based value.However, there concerns power, I control, computational scalability methods. Existing often relatively low power,9Jia situations where few within phenotype.14Sun presence due linkage disequilibrium (LD), appropriately control error.16Moskvina V. Schmidt Vedernikov Owen M.J. Craddock Holmans O’Donovan M.C. Permutation-based approaches do adequately allow gene-wide multi-locus analysis.Eur. 20: 890-896Crossref (17) Resampling-based strategies used calculation,17Holmans Green E.K. Pahwa Ferreira Sklar Wellcome Trust Case-Control ConsortiumGene ontology GWA study data provides biology bipolar disorder.Am. 85: 13-24Abstract (325) current implementation, computationally very expensive, reducing applicability method, datasets. Although identifying possibly signal important further downstream fail such genes.Here, describe efficient subset-based increase over while maintaining proper facilitate interpretation extracting focuses hypothesis, as main goal phenotype-associated loci. (called core subset) produce maximum calculated fast copula-based simulation approximation generalized Pareto distribution.18Knijnenburg Wessels L.F.A. Reinders M.J.T. Shmulevich I. Fewer permutations, accurate P-values.Bioinformatics. i161-i168Crossref (133) Scholar,19Pickands Statistical Inference Extreme Order Statistics.Ann. Stat. 1975; 3: 119-131Crossref constructed set. directly computed level genotype if available approximated (effect sizes, standard errors, minor allele frequency). matrices makes applicable biobank-scale datasets.Through computer simulation, show powerful correct error. applied phenotypes20Bycroft Freeman Petkova Band G. Elliott L.T. Sharp Motyer Vukcevic Delaneau O. O’Connell resource deep phenotyping data.Nature. 2018; 562: 203-209Crossref (2025) derived molecular signature database (MsigDB v.6.2),21Liberzon Birger Thorvaldsdóttir Ghandi Mesirov J.P. Tamayo Molecular Signatures Database (MSigDB) hallmark collection.Cell Syst. 1: 417-425Abstract (3464) demonstrating feasible large-scale new made results publicly visual browser.Material methodsTo conduct need regions Popular tests, including SKAT22Wu Lee Cai Boehnke Rare-variant testing sequencing sequence kernel test.Am. 89: 82-93Abstract (1536) SKAT-Common-Rare23Ionita-Laza Makarov Buxbaum J.D. Sequence effect common variants.Am. 2013; 92: 841-853Abstract (297) expression prediXcan24Gamazon E.R. H.E. Shah K.P. Mozaffari S.V. Aquino-Michaels Carroll R.J. Eyler A.E. Denny J.C. Nicolae D.L. Im H.K. GTEx ConsortiumA mapping traits transcriptome data.Nat. 1091-1098Crossref (737) TWAS-FUSION,25Gusev Ko Bhatia Chung Penninx B.W. Jansen Geus E.J. Boomsma D.I. Wright F.A. al.Integrative transcriptome-wide studies.Nat. 48: 245-252Crossref (788) obtain individual-level available. If size, error, value, frequency variant) available, approximate LD information suitable panel (see Appendix A).26Lumley Brody Peloso Morrison Rice FastSKAT: markers.Genet. 42: 516-527Crossref (15) given following steps.Step statisticTo construct statistic, start m Suppose Pvaluei ith (i=1, …, m) first convert score zi=Φ−1(1−Pvaluei), Φ−1 cumulative distribution function. Here, SKAT-Common-Rare statistics, prediXcan values.For non-empty B⊆H, define S(B), B, S(B)=∑i∈Bzi/|B|, |B| H HGAUSS(H)=maxB⊆H∑i∈Bzi|B|.Although overall 2m − subsets H, complexity greatly reduced rewriting formula asGAUSS(H)=maxk∈{1,..,m}maxBk⊆H∑i∈Bkzi|Bk|,where Bk denotes k elements. It easy thatmaxBk⊆H∑i∈Bkzi|Bk|=z(1)+ z(2)+..+z(k)k,(Equation 1) z(1), z(2), z(m) ordered decreasing order z(1) maximum. Equation holds regardless joint (for detailed proof see supplemental section A). implement algorithm statistic:1.order z(m);2.starting =1, compute Sk=(z1+z2+..+zk)/k all 1, 2, m;3.calculate maxk∈{1,..,m}Sk.Using this approach, cost O(2m) O(mlogm). term B attained (CS) H.Step 2: calculationBecause same region, step dependent. Thus, challenging derive analytically. Instead, employ approach. estimate (VˆH) (z1, z2, · ·, zm) under estimated sample itself ancestry-matched panel. use individuals 1000 Genomes data27Abecasis G.R. Auton Brooks L.D. DePristo Durbin Handsaker R.E. Kang H.M. Marth G.T. McVean G.A. Project ConsortiumAn integrated map variation 1,092 human genomes.Nature. 491: 56-65Crossref (5610) note VˆH needs once dataset reused iterations. With VˆH, multivariate C D). Now simulated repeatedly generating mean zero covariance calculating observed details). reduce cost, resampling scheme B). (e.g., < 5×10−6), (GPD)-based method18Knijnenburg fit GPD upper tail right-tailed second Anderson-Darling (GPD-AD2R) Figure S1) inverting function fitted GPD.Simulation studiesWe carried out evaluate performance Biobank. realistic patterns generative model, genotypes 5,000 unrelated participants throughout simulations. understand selected three length GO terms MSigDB (v.6.2) simulations: regulation blood volume renin angiotensin (GO: 0002016; 11 genes), sterol metabolic process 0016125; 123 immune response 0006955; 1,100 genes).We at least variant annotated non-zero size. randomly ga active and, lth tl va;l effects. N Biobank, generate i (i N) according modelYi=∑k=1TβkGik+εi,where εi~N(0,1) Gik kth T=∑l=1gatlva;l total Throughout simulations, 5,000. size MAFk generated bik c|log10(MAFk)|, c magnitude 0, > 0. determined fixing explained (hgs2). hgs2 1% 10%. 20%–30% having corresponding varied approximately 0.10 0.25.Given above, Europeans (sample 498) extract genes.UK analysisWe were SAIGE28Zhou Nielsen J.B. Fritsche L.G. Dey Gabrielsen M.E. Wolford B.N. LeFaive VandeHaar Gagliano S.A. Gifford al.Efficiently controlling case-control imbalance relatedness 50: 1335-1341Crossref (339) PheWeb entry web resources). files included markers genotyped imputed Haplotype Reference Consortium (HRC), produced 28 million MAC ≥ 20 imputation info 0.3. EPACTs29Kang Sul J.H. Service Zaitlen N.A. Kong S.Y. Freimer N.B. Sabatti Eskin Variance component model 348-354Crossref (1556) resources) RefSeq annotation. gene, non-synonymous kb exon regulatory variants. extracted matrix emeraLD. 18,334 estimates SAIGE (b), (SE), (MAF). transformed phenotype.ResultsSimulation resultsType rates setsTo normally distributed studies), independent genotypes. then subsequently test. errors remained well calibrated × 10−4, 10−5, 5 10−6 (Table sets.Table 1Estimated GO: 0016125, 0006955, 0002016αGO: 0016125 (123 genes)GO: 0006955 (1,100 0002016 (11 genes)1 10−49.8 10−59.8 10−59.7 10−51 10−59.9 10−69.3 10−69.6 10−65 10−64.6 10−64.8 10−64.2 Open table tab Next, compared association, spectrum models methods: SKAT (SKAT-Pathway), aSPUpath. considered scenario (16.2%) 30% causal. (hgs2) 6%. empirical increased increasing hgs2. MAGMA had similar (Figure 1; left panel) scenarios, SKAT-Pathway lowest power. aSPUpath slightly lower 1%–3% 4%–6%.Next, signals sparser right panel), i.e., (1.6%) six (5.0%) active. fixed ~3%. settings, was method. gap Among highest 10 trend larger (Figures S2 S3).Identification genesWe investigated sensitivity specificity genes. Sensitivity defined correctly CS inactive Because attempt heuristic defining significant (p 2.5×10−6) both higher (>75%) different 2). also evaluated exact stringent criteria specificity. Under magnitudes sizes heritability, probability via slight genes) Figures S4 S5).Figure 2Sensitivity, specificity, non-null GAUSSShow full captionThe across (horizontal axis) genes).(A–C) (A), (B), (C) (solid line) (dashed line). studies) 30%.View Large Image ViewerDownload Hi-res image Download (PPT)Simulation highlight utility Especially weakly phenotype, Further high direct way interpret findings.Association BiobankWe phenotypes28Zhou disease-related material methods). MsigDB collections: curated (C2) KEGG, BioCarta, Reactome databases representing signatures chemical perturbations contain (C5). (SKAT-Common-Rare) consisting pair (if reported significant). Bonferroni corrected

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An atlas of genetic associations in UK Biobank

Genome-wide association studies have revealed many loci contributing to the variation of complex traits, yet the majority of loci that contribute to the heritability of complex traits remain elusive. Large study populations with sufficient statistical power are required to detect the small effect sizes of the yet unidentified genetic variants. However, the analysis of huge cohorts, like UK Biob...

متن کامل

Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation

BACKGROUND Gene-set enrichment analysis is a useful technique to help functionally characterize large gene lists, such as the results of gene expression experiments. This technique finds functionally coherent gene-sets, such as pathways, that are statistically over-represented in a given gene list. Ideally, the number of resulting sets is smaller than the number of genes in the list, thus simpl...

متن کامل

Gene–obesogenic environment interactions in the UK Biobank study

Background Previous studies have suggested that modern obesogenic environments accentuate the genetic risk of obesity. However, these studies have proven controversial as to which, if any, measures of the environment accentuate genetic susceptibility to high body mass index (BMI). Methods We used up to 120 000 adults from the UK Biobank study to test the hypothesis that high-risk obesogenic e...

متن کامل

task-based language teaching in iran: a mixed study through constructing and validating a new questionnaire based on theoretical, sociocultural, and educational frameworks

جنبه های گوناگونی از زندگی در ایران را از جمله سبک زندگی، علم و امکانات فنی و تکنولوژیکی می توان کم یا بیش وارداتی در نظر گرفت. زبان انگلیسی و روش تدریس آن نیز از این قاعده مثتسنی نیست. با این حال گاهی سوال پیش می آید که آیا یک روش خاص با زیر ساخت های نظری، فرهنگی اجتماعی و آموزشی جامعه ایرانی سازگاری دارد یا خیر. این تحقیق بر اساس روش های ترکیبی انجام شده است.پرسش نامه ای نیز برای زبان آموزان ...

Associations between single and multiple cardiometabolic diseases and cognitive abilities in 474 129 UK Biobank participants

Aims Cardiometabolic diseases (hypertension, coronary artery disease [CAD] and diabetes are known to associate with poorer cognitive ability but there are limited data on whether having more than one of these conditions is associated with additive effects. We aimed to quantify the magnitude of their associations with non-demented cognitive abilities and determine the extent to which these assoc...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: American Journal of Human Genetics

سال: 2021

ISSN: ['0002-9297', '1537-6605']

DOI: https://doi.org/10.1016/j.ajhg.2021.02.016